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ABSTRACT 

In this paper we consider the problem of anonymizing datasets 
in which each individual is associated with a set of items 
that constitute private information about the individual. Il- 
lustrative datasets include market-basket datasets and search 
engine query logs. We formalize the notion of k-anonymity 
for set-valued data as a variant of the fc-anonymity model for 
traditional relational datasets. We define an optimization 
problem that arises from this definition of anonymity and 
provide 0(felogA;) and 0(l)-approximation algorithms for 
the same. We demonstrate applicability of our algorithms 
to the America Online query log dataset. 
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Consider a dataset containing detailed information about 
the private actions of individuals, e.g., a market-basket data- 
set or a dataset of search engine query logs. Market-basket 
datasets contain information about items bought by individ- 
uals and search engine query logs contain detailed informa- 
tion about the queries posed by users and the results that 
were clicked on. There is often a need to publish such data 
for research purposes. Market-basket data, for instance, 
could be used for association rule mining and for the design 
and testing of recommendation systems. Query logs could 
be used to study patterns of query refinement, develop algo- 
rithms for query suggestion and improve the overall quality 
of search. 

The publication of such data, however, poses a challenge 
as far as the privacy of individual users is concerned. Even 
after removing all personal characteristics of individuals such 
as actual usernames and ip addresses, the publication of such 
data is still subject to privacy attacks from attackers with 
partial knowledge of the private actions of individuals. Our 
work in this paper is motivated by two such recent data 
releases and privacy attacks on them. 

In August of 2006, America Online (AOL) released a large 
portion of its search engine query logs for research pur- 
poses. The dataset contained 20 million queries posed by 
650, 000 AOL users over a 3 month period. Before releas- 
ing the data, AOL ran a simplistic anonymization procedure 
wherein every username was replaced by a random identi- 
fier. Despite this basic protective measure, the New York 
Times [6] demonstrated how the queries themselves could 
essentially reveal the identities of users. For example, user 
4417749 revealed herself to be a resident of Gwinnett County 
in Lilburn, GA, by querying for businesses and services in 
the area. She further revealed her last name by querying 



for relatives. There were only 14 citizens with her last name 
in Gwinnett County, and the user was quickly revealed to 
be Thelma Arnold, a 62 year old woman living in Georgia. 
From this point on, researchers at the New York Times could 
look at all of the queries posed by Ms. Arnold over the 3 
month period. The publication of the query log data thus 
constituted a very serious privacy breach. 

In October of 2006, Netflix announced the $l-million Net- 
flix Prize for improving their movie recommendation system. 
As a part of the contest Netflix publicly released a dataset 
containing 100 million movie ratings created by 500, 000 
Netflix subscribers over a period of 6 years. Once again, a 
simplistic anonymization procedure of replacing usernames 
with random identiflers was used prior to the release. Nev- 
ertheless, it was shown that 84% of the subscribers could 
be uniquely identified by an attacker who knew 6 out of 
8 movies that the subscriber had rated outside of the top 
500 [19]. 

The commonality between the AOL and Netflix datasets 
is that each individual's data is essentially a set of items. 
Further this set of items is both identifying of the individ- 
ual as well as private information about the individual, and 
partial knowledge of this set of items is used in the privacy 
attack. In the case of the Netflix data (representative of 
market-basket data), for instance, it is the set of movies 
that a subscriber rated, and in the case of the AOL data, it 
is the set of queries that a user posed, also called the user 
session. 

Motivated by these examples, as well as by the very real 
need for releasing such datasets for research purposes, we 
propose a notion of anonymity for set-valued data in this 
paper. Informally, a dataset is said to be fc-anonymous if 
every individual's "set of items" is identical to those of at 
least k — 1 other individuals. So a user in the Netflix dataset 
would be fc-anonymous if at least k — 1 other users rated 
exactly the same set of movies; a user in the AOL query 
logs would be fc-anonymous if at least fc — 1 other users 
posed exactly the same set of queries. 

One simple way to achieve fc-anonymity for a dataset 
would be to simply remove every item from every user's 
set, or to add every item from the universe of items to ev- 
ery single set. Naturally this would radically distort the 
dataset rendering it useless for analyses. So instead, to pro- 
vide greater utility than such a simplistic scheme, we seek to 
make the minimal number of changes possible to the dataset 
in order to achieve the anonymity requirements. We pro- 
vide O(fclogfc) and 0(l)-approximation algorithms for this 
optimization problem. Further we demonstrate how these 



algorithms can be scaled for application to massive modern 
day datasets such as the AOL query logs. To summarize our 
contributions. 

• We define the notion of fc-anonymity for set-valued 
data and introduce an optimization problem for mini- 
mally achieving /c-anonymity in Section 3. 

• We provide algorithms with approximation factors of 
0{k log k) and 0(1) for the optimization problem in 

Section 4. 

• In Section 5, wc demonstrate how our algorithms can 
be scaled for application to massive datasets and ex- 
periment on the AOL logs . 

Before proceeding further, note that illustrative datasets 
used as motivating examples above also contain further user 
information: time stamp information for when a rating was 
given and the actual rating itself in the Netflix data; time 
stamp information for when a query was posed and the query 
result that was clicked on in the AOL data. However for the 
purposes of this paper, we ignore these other attributes of 
the dataset and discuss how they could potentially be dealt 
with in Section 5.5. Indeed the privacy attacks mentioned 
above did not involve knowledge of these other attributes, 
and therefore the anonymization problem on even just the 
reduced set of attributes is important to study. 

We will next briefly review related work where we distin- 
guish our problem from the traditional fc-anonymity problem 
that has been studied for relational datasets. 

2. RELATED WORK 

There has been considerable prior work on anonymizing 
traditional relational datasets such as medical records. The 
most widely studied anonymity definitions for such datasets 
are k-anonymity [3, 18, 20, 23, 15] and its variants, l-diversity 
[17] and t-closeness [16]. In all these definitions, certain 
public attributes of the dataset are initially determined to 
be "quasi-identifiors" . For instance, in a dataset of med- 
ical records, attributes such as Datc-of-Birth, Gender and 
Zipcode would qualify as quasi-identifiers since in combina- 
tion they can be used to uniquely identify 87% of the U.S. 
population [23] . A dataset is then said to be fc-anonymous if 
every record in the dataset is identical to at least fc — 1 other 
records on its quasi-identifying attribute values. The idea 
is that privacy is achieved if every individual is hidden in a 
crowd of size at least fc. Anonymization algorithms achieve 
the fc-anonymity requirement by suppressing and generaliz- 
ing the quasi-identifying attribute values of records. A triv- 
ial way to achieve fc-anonymity would be to simply suppress 
every single attribute value in the dataset, but this would 
completely destroy the utility of the dataset. Instead, in 
order to preserve utility, the algorithms attempt to achieve 
the anonymity requirement with a minimum number of sup- 
pressions and generalizations. 

The kinds of datasets that we consider in this paper differ 
from traditional relational dateisets in two ways. First, each 
database record in our scenario essentially corresponds to a 
set of items. The database records could thus be of variable 
length and high dimensionality. Further, there is no longer a 
clear distinction between private attributes and quasi iden- 
tifiers. A user's queries are both private information about 



the user as well as identifying of the user himself. Similarly, 
in the case of market-basket data, the set of items bought by 
an individual are private information about the individual 
and at the same time can be used to identify the individual. 
Our definition of anonymity and anonymization algorithms 
are applicable for such set-valued data. 

In [24] the authors study the problem of anonymizing 
market-basket data. They propose a notion of anonymity 
similar to fc-anonymity where a limit is placed on the num- 
ber of private items of any individual that could be known to 
an attacker beforehand. The authors provide generalization 
algorithms to achieve the anonymity requirements. For ex- 
ample, an item 'milk' in a user's basket may be generalized 
to 'dairy product' in order to protect it. In contrast, the 
techniques we propose consider additions and deletions to 
the dataset instead of generalizations. Further, we demon- 
strate applicability of our algorithms to search engine query 
log data as well where there is no obvious underlying hier- 
archy that can be used to generalize queries. 

Our 0(l)-approximation algorithm is derived by reducing 
the anonymization problem to a clustering problem. Clus- 
tering techniques for achieving anonymity have also been 
studied in [2], however here the authors seek to minimize 
the maximum radius of the clustering, whereas we wish to 
minimize the sum of the Hamming distances of points to 
their cluster centers. 

In [25] the authors propose the notion of {h, fc, p)-coherence 
for anonymizing transactional data. Here once again there 
is a division of items into public and private items. The 
goal of the anonymization is to ensure that for any set of 
p public items, either no transaction contains this set, or 
at least fc transactions contain it, and no more than h per- 
cent of these transactions contain a common private item. 
The authors consider the minimal number of suppressions 
required to achieve these anonymity goals, however no the- 
oretical guarantees are given. 

Besides the fc-anonymization based techniques, there has 
also been considerable work on anonymizing datasets by the 
addition of noise or perturbation [4, 9, 5]. We do not con- 
sider perturbation-based approaches in this paper. 

With regards to search engine query logs, there has been 
work on identifying privacy attacks both on users [14] as well 
as on companies whose websites appear in query results and 
get clicked on [21]. We do not consider the latter kind of 
privacy attack in this paper. [14] considers an anonymiza- 
tion procedure wherein keywords in queries are replaced by 
secure hashes. The authors show that such a procedure is 
susceptible to statistical attacks on the hashed keywords, 
leading to privacy breaches. There has also been work on 
defending against privacy attacks on users in [1] . This line of 
work considers heuristics such as the removal of infrequent 
queries and develops methods to apply such techniques on 
the fly as new queries are posed. In contrast, we consider a 
static scenario wherein a search engine would like to publicly 
release an existing set of query logs. 

3. DEFINITIONS 

Let D = {Si, . . . , Sn} be a dataset containing n records. 
Each record Si is a set of items. Formally Si is a non-empty 

subset of a universe of items, U = {ei,e2,..., em}- We can 
then define an anonymous dataset as follows. 

Definition 1. (fc- Anonymity for Set- Valued Data) We say 
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Figure 1: 2-Anonymization 

that D is fc-anonymous if every record Si € D is identical to 
at least k — 1 other records. 

Given this definition, we can now define an optimization 
problem that asks for the minimum number of transforma- 
tions to be made to a dataset to obtain an anonymized 
dataset. 

Definition 2. (The fc-Anonymization Problem for Set- Val- 
ued Data) Given a dataset D = {Si, . . . , Sn}, find the min- 
imum number of items that need to be added to or deleted 
from the sets Si, . . . , Sn to ensure that the resulting dataset 
D' is A;-anonymous. 

We illustrate the fc-anonymization problem with an exam- 
ple. 

Example 1. Consider the dataset in Figure 1(a). The 
dataset in Figure 1(b) represents a 2-anonymous transfor- 
mation that is obtained by making 2 additions and 1 dele- 
tion. The items 63 and 62 are added to records 52 and 

^3 respectively while the item eg is deleted from record 54. 
The resulting dataset consists of two 2-anonymous groups: 
{51,52,53} and {54,55}. 

As a more concrete example, in the case of market-basket 
data, the dataset consists of records, where each record 
is a basket of items purchased by an individual. The k- 
anonymization problem then is to add or delete items to 
individuals' baskets so that every basket is identical to at 
least k — 1 other baskets. 

In the case of search engine query logs, the records corre- 
spond to user sessions. Instead of treating each user session 
as a set of queries, we considered a relaxed problem and 
treat each user session as a set of query terms or keywords. 
See Section 5 for the details. The fc-anonjariization problem 
then becomes one of adding or deleting keywords to or from 
user sessions to ensure that each user session becomes iden- 
tical to at least k — 1 other user sessions. Since no two user 
sessions are likely to be similar on all the queries, we con- 
sider a slightly modified problem in our experiments. Each 
user session is first separated into "topic-based" threads, and 
our goal becomes one of anonymizing these threads instead 
of the original sessions. The result is an increase in the util- 
ity of the released dataset. Again, Section 5 elaborates on 
the details. 

More generally, the dataset can be thought of as a bipar- 
tite graph, with sets (user sessions/baskets/individuals) rep- 
resented as nodes on the left hand side and items of the uni- 
verse (keywords searched for/items purchased/movies rated) 
as nodes on the right hand side. The fc-anonymization prob- 
lem then is to add or delete edges in the bipartite graph so 
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Figure 2: Dataset from Figure 1(a) as a relational 
dateiset 

that every node on the left hand side is identical to at least 
k — 1 other nodes. 

Depending on the application, it may make sense to re- 
strict the set of permissible operations to only additions or 
only deletions, however in this paper we consider the most 
general version of the problem that permits both. 

4. APPROXIMATION ALGORITHMS 

Given these definitions, we are now ready to devise algo- 
rithms for optimally achieving fc-anonymity. We first draw 
connections between the fe-anonymization problem for set- 
valued data and other optimization problems that have pre- 
viously been studied in literature, namely, the suppression- 
based fe-anonymization problem for relational data and the 
load-balanced facility location problem. The reductions to 
these problems automatically give us the approximation al- 
gorithms we desire. In what follows we do not describe the 
algorithms themselves, rather only the reductions. The al- 
gorithms can be found in [18, 3, 20, 10, 13, 22]. 

A natural question that arises is whether traditional fc- 
anonymity algorithms that involve suppressions and gener- 
alizations can be used for the fe-anonymization problem for 
set-valued data as defined in Section 1. To this end, we first 
translate the set-valued dataset to a traditional relational 
dataset. 

Transforming D to Rd 

A dataset D — {Si, . . . , Sn} can be transformed to a tradi- 
tional relational dataset Rd by creating a binary attribute 
for every item ei in the universe and a tuple for every set 
Si. Each tuple wifi then be a vector in {0,1}*". The I's 
correspond to items in the universe that a set contains and 
the O's correspond to those that it does not^. For exam- 
ple, the dataset from Figure 1(a) translates to the dataset 
in Figure 2. 

The fc-anonymization problem over D now translates to 
the following problem over Rd: 

Definition 3. (fc-Anonymization via Flips) Given a dataset 
Rd over a binary alphabet {0, 1}, flip as few O's to I's and 
I's to O's in Rd as possible so that every tuple is identical 
to at least fc — 1 other tuples. 

It is trivial to see that there is a one-to-one correspondence 
between feasible solutions for the fe-anonymization problem 

^Note that at no point do our approximation algorithms ever 
explicitly construct these bit vectors. Rather they operate 
directly on the set representations of the tuples, comput- 
ing intersections of pairs of sets. The algorithms therefore 
scale with the maximum set size rather than m. The bit 
vector representations have only been used here for ease of 
exposition. 



over D and the flip-based fc-anonymization problem over 
Rd- 

Proposition 1. Any feasible solution, Sfup, to the flip- 
based k-anonynuzation problem over Rd can be converted to 
a feasible solution, S±, of the same cost for the k-anonymiza- 
tion problem over D and vice versa. 

Proof Sketch. For every that is flipped to a 1 in SfUp, 
simply add the corresponding item to the corresponding set 
in S± , and for every 1 that is flipped to a 0, delete the item 
from the set. 

Now the flip-based A;-anonymization problem can be solved 
using suppression-based fe-anonymization techniques for tra- 
ditional relational datasets studied in [18, 3, 20]. The prob- 
lem studied here essentially boils down to the following. 

Definition 4- (fc-Anonymization via Suppressions) Given 
a dataset Rd over a binary alphabet {0, 1}, what arc the 
minimum number of O's and I's in Rd that need to be 
converted to *'s to ensure that every tuple is identical to 
at least k — 1 other tuples. 

Now it is easy to see that the following holds. 

Proposition 2. Any feasible solution to the suppression- 
based k-anonymization problem can be converted to a j 
flip-based solution Sfup using Algorithm 1. 



Algorithm 1 Converting .S* to Sfu 



1: 
2: 
3: 
4: 
5 
6 
7 
8 
9 

10: 
11 



/ /input: Rd, S, 

for every fc-anonyinous group of tuples G in iS* do 
for every column C do 

/ /Cg = C values for rows in G m. Rd 

if number of I's in Co > number of O's then 

flip the O's in Co to I's 
else 

flip the I's in Co to O's 
end if 
end for 
end for 
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Figure 3: SfUp is obtained from 5* via Algorithm 1 



Proof. Lot OPT, and OPTfup be the optimal solutions 
to the suppression-based and flip-based fc-anonymization prob- 
lems over Rd respectively. Then it is easy to sec that 
Cost(OPr,) < {2k - l)Cost(OPT/(ip). This is because ev- 
ery fe-anonymous group of tuples in OPTfUp consists of at 
most 2fe — 1 tuples. Further, this group can be converted 
to a fc-anonymous group obtained by suppressions by *ing 
out any column that contains a flip (essentially the reverse 
of Algorithm 1). 

It is also easy to see that the cost of any solution Sfup ob- 
tained by applying Algorithm 1 to a solution 5* is less than 
the cost of S* . This gives us the following set of inequalities 
and our desired result. 



Cost(5'/Kp) < Cost(5'H,) 

< aCost (OPT.) 

< a{2k - l)Cost{OPTfiip) 

□ 



The algorithm essentially takes every fe-anonymous group 
of tuples in iS*. Then for any column in the group that is 
suppressed (*ed out), it replaces the column for that group 
entirely with I's or entirely with O's depending on which 
action would involve a fewer number of flips in the original 
dataset Rd. 

Example 2. Figure 3 shows an example of an original dataset, 
a 2-anonymous dataset iS« obtained via suppressions, and 
a flip-based 2-anonymous dataset SfHp obtained by apply- 
ing Algorithm 1 to <S.. In both the solutions, the two 2- 
anonymous groups are {^i, S4, S5} and {S2, Sg, Sq}. 

Now we can show the following about Algorithm 1. 

Theorem 1. For a given dataset Rd, let the cost of a fea- 
sible solution S, to the suppression-based k-anonymization 
problem be within a factor a of the cost of the optimal solu- 
tion. Then the cost ofSfUp obtained by applying Algorithm 1 
to iS* is within a factor of 0{ka) of the cost of the optimal 
solution for the flip-based k-anonymization problem. 



The best possible suppression-based fc-anonymization al- 
gorithm thus gives us a good flip-based anonymization al- 
gorithm through the application of Algorithm 1. Since the 
suppression-based algorithm from [20] has an approximation 
ratio of O(logfc), Theorem 1 together with Proposition 1 
gives us the following result. 

Corollary 1. There exists an 0{k\ogk) -approximation 
algorithm to the k-anonymization problem for set-valued data. 

The suppression algorithm from [20] essentially consid- 
ers all possible partitions of the dataset into fc-anonymous 
groups and chooses a good one using a set-cover type greedy 
algorithm. 

The translation of D to Rd also enables the insight that 
the fc-anonymization problem over set-valued data is essen- 
tially a clustering problem. Each set can be viewed as vec- 
tor in {0, 1}'". The optimal solution to the following clus- 
tering problem then gives us an optimal solution to the fc- 
anonymization problem for set-valued data. 



Definition 5. (The /c-Group Clustering Problem) Given a 
set of points in {0, 1}'", cluster the points into groups of size 
at least k and assign cluster centers in {0, 1}™ so that the 
sum of the Hamming distances of the points to their cluster 
centers is minimized. 

The following proposition tells us that there is a one-to- 
one correspondence between feasible solutions to the fc-group 
clustering problem and the fc-anonymization problem for set- 
valued data. 

Proposition 3. Given a solution, Sgroup, to the k-group 
clustering problem over a dataset Rd, we can obtain a so- 
lution S± of the same cost to the k-anonymization problem 
over D and vice versa. 

Proof Sketch, For every cluster in Sgroup, create a fc-anonjan- 
ous group of the sets corresponding to the cluster points in 
S±. fc- anonymity is achieved by adding or deleting items as 
necessary so that every set in the group becomes identical 
to the set corresponding to the cluster center. The sum of 
the Hamming distances of points to their cluster centers in 
Sgroup thus corresponds to the total number of additions 
and deletions of items to obtain the solution S± . 

Given Proposition 3, we can now focus on solving the k- 
group clustering problem from here on. In this regard, the 
following result tells us that it suffices to consider potential 
cluster centers from amongst the data points themselves. 

Theorem 2. The cost of the optimal solution to the k- 
group clustering problem when the cluster centers are chosen 
from amongst the set of data points themselves is at most 
twice the cost of the optimal solution to the k-group clus- 
tering problem when the cluster centers are allowed to be 
arbitrary points in {0, l}"*. 

Proof. Let OPT be the optimal solution to the fc-group 
clustering problem when the cluster centers are allowed to be 
arbitrary points in {0, 1}"*. Now consider a solution Srand 
that maintains the same cluster groups as OPT, but replaces 
each cluster center with a randomly chosen data point from 
within the cluster. The expected cost of this solution is 
given below. 

E[cost(5..„.)] =y.y: iSSo 

GeQ cec 1 ' 

Here G is the set of all clusters in Srand (which is the same 
as the set of clusters in OPT). C is the columns/dimensions 

of the dataset Rd. N^"^ and iV,f^ are the number of I's 
and number of O's respectively that the points in a cluster 
G have in column G. The cost of the optimal solution on 
the other hand is given by 

Cost (OPT) = J2J2 min(iVf'',iVo^°)- 
Gee Cec 

By simple algebraic manipulation, it is easy to see that 

E[Cost(5^„nd)] < 2Cost(OPT). 

Since the expected cost of Srand is less than twice the cost 
of OPT, there must exist some clustering solution where 
the cluster centers are chosen from the data points them- 
selves whose cost is less than twice the cost of OPT. This 
completes the proof of the theorem. □ 



Theorem 2 considerably simplifies the clustering problem 
since there is now only a linear number of potential cluster 
centers that need be considered (as opposed to 2"^). We cam 
now frame this modified fe-group clustering problem as an 
integer program. 

min Xijdij 

s.t Xij < yj V i,j 

Ei > kyj V j 

Xij,yj€ {0,1} ^i,j 

Here yj is an indicator variable that indicates whether or 
not data point Sj is chosen as a cluster center. Xij is an in- 
dicator variable that indicates whether or not data point Si 
is assigned to cluster center Sj and dij is the Hamming dis- 
tance between data points Si and Sj. This integer program- 
ming formulation is exactly equivalent to the load-balanced 
facility location problem studied in [10, 13, 22]. The cluster 
centers can be thought of as facilities, and the data points 
as demand points. The task then is to open facilities and 
assign demand points to opened facilities so that the sum of 
the distances to the facilities is minimized and every facility 
has at least fc demand points assigned to it. The algorithms 
for this problem work by solving a modified instance of a 
regular facility location problem (without the load balanc- 
ing constraints), and then grouping together facilities that 
have fewer than k demand points assigned to them. The 
result from [22] in conjunction with Theorem 2 and Propo- 
sition 3, gives us the following result. 

Theorem 3. There exists an 0(1) -approximation algo- 
rithm for the k-anonymization problem for set-valued data. 

To reemphasize the earlier footnote, the approximation al- 
gorithms for suppression-based anonymization or load-bala- 
nced facility location never need to explicitly compute and 
operate on the bit vector representations of the records. 
They can operate directly on the set representations, com- 
puting distances between pairs of sets. Algorithm 1 need 
not operate on the bit-vector representations either. It can 
simply take every fc-group of sets and add every majority 
item in the group to all the sets in the group, while deleting 
other items. 

5. EXPERIMENTS 

In this section we experimentally demonstrate applicabil- 
ity of our anonymization algorithms to the AOL query log 
dataset. Recall (Definition 1) that in this dataset records 
correspond to user sessions and items correspond to the 
query terms/keywords. As mentioned earlier, the query log 
dataset also contains other attributes that we ignore in this 
paper (see Section 5.5 for a discussion). Our goal then is 
to add or delete keywords from user sessions so that every 
session becomes identical to at least fc — 1 others. 

The anonymization algorithms from Section 4 cannot be 
directly applied to the AOL dataset for several reasons: (1) 
No two users in the dataset are likely to be similar on all 
their queries since each user session is fairly large, repre- 
senting 3 months of queries. The algorithms when directly 
applied to the user sessions would thus result in a large num- 
ber of additions and deletions. (2) The dataset consists of 



millions of users. The algorithms from Section 4 have a 
quadratic running time and therefore cannot be practically 
applied to such real world datasets directly. And (3) Differ- 
ent keywords from different users could often be misspellings 
of each other or derivations from a common stem. The con- 
ditions for considering two user sessions to be "identical" 
thus need to be relaxed. 

Wo describe below the steps we took to overcome these 
three problems. 

5.1 Separating User Sessions into Threads 

To deal with the issue of large user sessions, we consid- 
ered a relaxed problem definition: Each user session was 
first divided into smaller threads and a different random 
identifier was assigned to each thread. We then considered 
the anonymization problem over these threads instead of the 
original sessions. Each user thread was treated as a set of 
keywords and our goal was to add or delete keywords from 
user threads so that every user thread became identical to 
threads from at least fe — 1 other users. 

One trivial way to divide sessions into threads is to treat 
every single query from a user as a thread of its own and as- 
sign a random identifier to it. However this would render the 
data nearly useless for many forms of analysis (e.g., study- 
ing patterns of query refinement). Instead "topic-based" 
threads were determined on the basis of the similarity of 
constituent queries. For this purpose we employed two sim- 
ple measures to determine query similiarity: 

• Edit distance: Two queries were deemed similar if the 
edit distance between them was less than a threshold. 

• Overlapping result sets: Two queries were deemed sim- 
ilar if the result sets returned for each query by a search 
engine had a large overlap in the top 50 results. 

Using these similarity measures, each user session was sep- 
arated into multiple threads: Queries in a user session were 
considered in the order of their time stamps. A query that 
was similar to one seen before was assigned the same iden- 
tifier as the previous query. A query that was very differ- 
ent from any of the previously seen queries was assigned a 
new identifier. This was followed by another round where 
consecutive threads that contained similar queries were col- 
lapsed and so on. This algorithm for determining threads 
was run on a random sample of ~ 827^ users who posed a to- 
tal of ^ 412if queries. The 9>2K user sessions were split into 
~ 165J£r threads. Each thread had on average 2.55 unique 
keywords. 

There may of course exist more sophisticated techniques 
for separating sessions into topic-based threads, however this 
is not the focus in this paper. Note that the shift in goal from 
anonymizing sessions to anonymizing threads, enhances the 
utility of the released dataset (anonymizing entire sessions 
would require far too many additions and deletions), with- 
out affecting privacy too much. In fact, as we shall see 
in Section 5.4, the separation into threads itself helps in 
anonymization. 

5.2 Pre-clustering User Threads 

As mentioned earlier, the algorithms from Section 4 have a 
quadratic running time, and cannot be practically applied to 
our dataset of user threads. To make them more scaleable, 



we first performed a preliminary clustering step where we 
clustered similar user threads together using a simple, fast 
clustering algorithm, and then applied the fc- anonymization 
algorithms from Section 4 to the threads within each cluster. 
If a cluster had fewer than k user threads, we simply deleted 
these threads altogether. Running the fc-anonymization al- 
gorithms within these small clusters was much more efficient 
than running them directly on all the user threads at once. 

To do the preliminary clustering, we used the Jaccard co- 
efficient as a similarity measure for user threads. Recall 
that each thread Si is a subset of the universe of keywords 
U = {ei, . . . , em}. Under the Jaccard measure, the similar- 
ity of two user threads. Si and Sj is given by 



A straightforward clustering algorithm would involve a 
comparison between every pair of user threads and would 
thus be very ineffcient. Instead, to quickly cluster all the 
user threads, we used Locality Sensitive Hashing (LSH). The 
LSH technique was introduced in [12] to efficiently solve the 
nearest-neighbour search problem. The key idea is to hash 
each user thread using several different hcish functions, en- 
suring that for each function, the probability of collision is 
much higher for threads that are similar to each other than 
for those that are different. The Jaccard coefficient as a simi- 
larity measure admits an LSH scheme called Min-Hashing [8, 
7]- 

The basic idea in the Min-Hashing scheme is to randomly 
permute the universe of keywords U , and for each user thread 
Si, compute its hash value MH(S'i) as the index of the first 
item under the permutation that belongs to Si. It can be 
shown [8, 7] that for a random permutation the probability 
that two user threads have the same hash function is exactly 
equal to their Jaccard coefficient. Thus Min-Hashing is a 
probabilistic clustering algorithm, where each hash bucket 
corresponds to a cluster that puts together two user threads 
with probability proportional to their Jaccard coefficient. 
The LSH algorithm [12] concatenates p hash-keys for users 
so that the probability that any two users Si and Sj agree on 
their concatenated hash-keys is equal to Svai^Si, SjY . The 
concatenation of hash-keys thus creates refined clusters with 
high precision. Typical values for p that we tried were in the 
range 2 — 4. 

Clearly generating random permutations over the universe 
of keywords and storing them to compute Min-Hash values is 
not feasible. So instead, we generated a set of p independent, 
random seed values, one for each Min-Hash function and 
mapped each user thread to a hash- value computed using the 
seed. This hash-value serves as a proxy for the index in the 
random permutation. The approximate Min-Hash values 
thus computed have properties similar to the ideal Min-Hash 
value [11]. Sec [11] for more details on this technique. 

As a result of running the LSH-based clustering algorithm 
on our user threads, we otained a total of ~ 84if clusters. 
Each cluster contained an average of 2 user threads. The 
largest cluster contained ~ 2800 threads and corresponded 
to the queries that searched for 'Google'! 

Again, there may exist more sophisticated techniques for 
clustering similar user threads together, however this is not 
the focus of this paper, which is meant to be more of a proof 
of concept. 
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Figure 4: Cost of achieving fe-anonymity 



5.3 fc-Anonymity within Clusters 

Now within each chistcr generated using the LSH sclieme 
above, we ran tiic fc-anonymization algoritiirn from Section 4 
(i.e., the suppression algorithm from [20] followed by the 
application of Algorithm 1). 

Before proceeding further, we need to clarify the criterion 
that was used for deeming two user threads to be identical. 
As mentioned earlier, different user threads might contain 
keywords that are actually just niisspellings of each other or 
derivations from a common stem. To deal with this issue, 
we once again resorted to LSH. We treated each user thread 
as a set of Locality Sensitive Hashes [8, 7] of its constituent 
keywords, i.e., a user thread Si = {ei, . . . , ee} now became 
Si = {LSH(ei), . . . , LSH(ei)} where LSH(ej) is a concate- 
nation of Min-Hashes of the keyword e-,-^. Two user threads 
were considered identical if they had the same set of hashes. 

Now if a fe-anonymous solution for a particular cluster 
deemed that a certain LSH value must be deleted from a 
particulax user thread, we simply deleted all the keywords 
from the user thread that generated that LSH value. If the 
solution asked for a LSH value to be added to a user thread, 
we added to the thread one of the keywords from its cluster 
that generated the LSH value. Threads in clusters of size 
less than k were entirely deleted. 

Figure 4 shows the total number of additions and dele- 
tions of keywords that were made for different values of k. 
As would be expected, as k increases, the total number of 
additions and deletions that need to be made to achieve k- 
anonymity increases. The number of additions is a small 
fraction of the total cost, and surprisingly goes down as k 
increases. 

5.4 Case Study 

As anecdotal evidence of the effectiveness of our algo- 
rithms in anonymizing query logs, we looked at the query 
logs of user 4417749 who had been previously been identified 
as Ms. Thelma Arnold from Lilburn, Georgia. 

Figure 5(a) shows a sample of user 4417749's query logs. 
Misspellings have been maintained, however repeated queries 
have been removed. As can be seen, the user searched for 



4417749 pine straw lilburn delivery 

4417749 pine straw delivery in gwinnett county 

4417749 pine straw in lilburn ga. 

4417749 atlant humane society 

4417749 atlanta humane society 

4417749 dekalb animal shelter 

4417749 dekalb humane society 

4417749 gwinnett animal shelter 

4417749 doraville animal shelter 

4417749 humane society 

4417749 gwinnett humane society 

4417749 sefTects of nicotine 

4417749 effects of nicotine 

4417749 nicotine effects on the body 

4417749 jarrctt arnold 

4417749 jarrett t. arnold 

4417749 jarrett t. arnold eugene Oregon 

4417749 eugene Oregon jaylene arnold 

4417749 jaylene and jarrett arnold eugene or. 



(a) User 4417749's Session 



1 4417749 pine straw lilburn < 
1 4417749 pine straw delivery in ■ 
1 4417749 pine straw in lilburn ga^ 



mulch 



4417749 atlant humane society county 
4417749 atlanta humane society 
4417749 dekalb animal shelter 

4417749 dekalb humane society 
4417749 gwinnett animal shelter 



4417749 doraville animal shelter 
4417749 humane society 

humane society 



3 4417749 seffects of nicotine 
3 4417749 effects of nicotine 

3 4417749 nicotine effects on the body 

4 4417749 jarrett ajnold 

4 4417749 jarrett t. arnold 

4 4417749 jarrott t. aruokl (nigniic Oregon 

4 4417749 eugene orogon jaylene arnold 

4 4417749 jaylono and jarrott arnold ougono or. 



(b) User 4417749's anonymized threads 



Figure 5: User 4417749's Query Logs 



^Each keyword can be treated as a multiset of characters 



some fairly generic queries such as the "effects of nicotine 
on the body". However she also posed several identifying 
queries. For instance, she queried for humane societies and 
animal shelters in Gwinnett county, Georgia, revealing her- 
self to be an animal lover in Gwinnett county. Further, she 
queried for pine straw delivery in Lilburn, Gwirmctt, thereby 
revealing herself to be a resident of Lilburn, Gwirmett. Fi- 
nally, her queries for relatives in Oregon revealed that her 
last name was "Arnold". 

Figure 5(b) shows the result of running our fc-anonymizat- 
ion algorithm for k = 3. Notice first that the division of 
Ms. Arnold's session into threads itself goes some way in 
anonymization by dc-corrclating her various query topics. 
The session sample was divided into a thread for pine straw 
delivery, a thread for animal shelters and humane societies, a 
thread for the effects of nicotine and a thread for the queries 
about relatives in Oregon. Each thread was assigned a sep- 
arate identifier. 

The threads were treated as sets of unique keywords (not 
depicted in the figure) and were then clustered with the 
threads of other users using LSIf. The anonymization algo- 
rithms were run within the resulting clusters. If a partic- 
ular keyword was to be deleted from a particular thread, 
we deleted every occurence of that keyword from the origi- 
nal queries of the thread. If a keyword was to be added to 
a thread, we added it to one of the original queries of the 
thread. The result was that some threads such as the nico- 
tine thread were left relatively untouched. In the thread 
for pine straw delivery, the keywords 'lilburn', 'delivery', 
'gwinnett', 'county' and 'ga.' were deleted, and the keyword 
'mulch' was added instead. This is because other users in 
the thread's cluster, querying for 'pine straw', queried for 
it in conjunction with the keyword 'mulch'. Similarly, in 
the thread for animal shelters and humane societies, the 
keywords 'gwinnett' and 'doraville' were removed, while the 
keyword 'county' was added since many users searched for 
animal shelters in 'dekalb county'. Finally, the thread for 
the relatives in Oregon was deleted altogether because not a 
sufficient number of threads from other users got clustered 
with it. Many users queried for 'arnold Schwarzenegger', 
however none of their threads fell in the same cluster! 

This example shows that our algorithm does the intu- 
itively right thing. Identifying keywords are removed and 
keywords that commonly occur in conjunction with other 
keywords are added to a user's threads. The guarantee is 
that every user thread will look like the threads of at least 
fc — 1 other users, and this guarantee is achieved while mak- 
ing a close to minimal number of additions and deletions. 

5.5 Discussion 

While the example of Ms. Thelma Arnold seems to indi- 
cate that our anonymization algorithms do the right thing 
for query logs, our experimental work here is in reality a 
first step due to the complex nature of the dataset. Several 
points require further discussion. 

Other Attributes: As mentioned earlier, query logs 
contain other information about user activity, namely time 
stamp information for when a query was posed and the 
query result that was clicked on. Our algorithm focussed 
on anonymizing just the queries themselves, whereas it is 
conceivable that these other attributes of the dataset may 



also be used in launching privacy attacks. One possible 
anonymization approach is to treat these other attribute val- 
ues as items of the universe as well and proceed as before. 
So for example, if a majority of users queried for the Indiana 
Jones movie on the day that it was released, then this day 
would be added as part of the time stamp to all user threads 
on the Indiana Jones movie. The drawback to this approach 
could be a loss of very fine-grained time stamp information 
and a better understanding of utility is required before this 
approach can be recommended. 

Privacy: In adapting our algorithm to the query logs, we 
considered a relaxation of the original problem statement: 
instead of anonymizing entire sessions, we anonymized threads. 
The privacy implications of this relaxation need to be fur- 
ther examined. At first glance, it seems that the division 
of the user sessions into threads only helps in our privacy 
goals by de-correlating a user's query topics. However there 
is no "proof" that a user's threads could not somehow be 
stitched together to reconstruct his session, which would 
then no longer be fc-anonymous. An experimental or the- 
oretical study of the implications of our problem relaxation 
would be an interesting avcrme for future work. 

Utility: Our approach of treating a thread as a set of 
keywords affects the utility of the released dataset. For ex- 
ample, in Figure 5(b), the keyword 'county' was added to 
the query for 'atlant humane society' since it was to be in- 
discriminately added to any one of the queries in the thread. 
In reality it should have been added to the query for 'dekalb 
animal shelter' and that too in the scmantically correct po- 
sition as 'dekalb county animal shelter'. Thus by treating 
threads as sets of keywords, we loose potentially important 
information about the ordering of keywords within queries. 
Another point regarding utility, is our criterion for measur- 
ing the utility of the released dataset. As in traditional k- 
anonymity work, the criterion we used was to minimize the 
total number of changes made to the dataset. A better met- 
ric for measuring the utility of the released dataset would be 
to measure the impact of the anonymization on algorithms 
that actually use the dataset. For example, how well does a 
search engine's query suggestion algorithm work when run 
on the released dataset instead of the original. This is a 
very interesting question, that would need to be ultimately 
answered for evaluating the utility of any anonymization 
scheme. 

6. SUMMARY AND FUTURE WORK 

In this paper we introduced the fc-anonymization problem 
for set-valued data. Algorithms with approximation factors 
of 0{klogk) and 0(1) for the problem were developed. We 
applied our anonymization algorithms to the AOL query log 
dataset. In order to scale the algorithms to deal with the 
size of the dataset, we proposed a division of the dataset 
into clusters, followed by the application of anonymization 
algorithms within the clusters. Besides the problems men- 
tioned in Section 5.5, there are several other avenues for 
future work. For instance, one interesting research direction 
would be to develop scaleable anonymization algorithms for 
massive modern day datasets with provable approximation 
guarantees. Another important research question is how 
such algorithms can be applied to anonymize datasets on 
the fly as new records get added to them. For example, as a 



search engine receives new queries, how should it anonymize 
them in an onhne fashion before storing them. 
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